I've been trying to put together a parser/scanner toolkit in C++ that emphasises ease of use, rather than optimality. BINDING: To start parsing a file, either specify it when creating the Parse object, or open it. Parse P("file"); Parse P; P.open("file"); The number of lines parsed so far can be obtained by the call cerr << "line #" << P.line(); MATCHING: Every ``match'' of the parser declares a string which the Parse object, P, should try to match in the file being parsed. While it succeeds (int) P is 1. If at some point it fails, (int) P is 0. There are mechanisms for specifying conjunction (and--represented by juxtaposition of ``matches'') and disjunction (or--represented by an OR. This will also force backtracking). There are primitive, pre-defined ``matches''. Finally, there are user-defined ``matches''. For the following section assume that the variable P has been declared as Parse P; The Primitive Matches: The following two matches specify precise matches -- match character: matches specified character P , 'x'; -- match string: matches specified string, first skipping any white-space. P , "alpha"; -- match end of file: self explanatory P , EOFL; The rest of the primitives match some string, and return a Token. A Token is basically a (possibly not null terminated) string and length pair. Assume that there is the following declaration: Token t; Further, all but match character skip all leading white-space. -- match number: matches string [-+]?[0-9]+ P , NUMBER(t); -- match string: matches "([^\\\n"]|(\.))*" P , STRING(t); -- match ident: matches [_a-zA-Z][_a-zA-Z0-9]* P , IDENT(t); -- match token: matches a string or ident, except that it returns a string without the quotes. P , TOKEN(t); -- match character: matches next character (including whitespace) P , CHARAC(t); And'ing Matches: Simply writing any number of matches next to each other (with ,s in between) is the same as and'ing. This is written as: P, match1, match2 ,..., matchN; The Parse object will try to find a match only if all preceding matches have succeeded. The Parse will match the and'ed matches if and only if it matches all of the matches. Token id, val; P, IDENT(id), "=", NUMBER(val); This example will match a string like 'X12 = -12' Or'ing Matches: Trying to match one of two or more match sequences is written as P, and1, OR, and2, OR, ..., OR, andN ; The Parse object will try to match an and sequence if and only if it has failed all preceding and sequences. It begins each and sequence at the same point. Token id, val; P, IDENT(id), "=", NUMBER(val), OR, IDENT(id), ":", STRING(val) ; This example will match string 'Xa: "alpha"' Of course, this would not be very useful just as it is, since it would be difficult to find which and was actually matched. So there is a primitive CHOOSE(N) (N is an integer) which always succeeds. If the Parse object tries to match with a CHOOSE, it always succeeds, and saves N. The value stored by the last CHOOSE of a (successful) Parse object P can be retrieved by the member function (). If the Parse was unsuccessful (i.e. if none of the ands matched) it returns 0. Token id, val; P, IDENT(id), "=", NUMBER(val), CHOOSE(1), OR, IDENT(id), ":", STRING(val), CHOOSE(2) ; switch(P()) { case 0: { cerr << "no match" << endl; break; } case 1: { cerr << "number match" << endl; break; } case 2: { cerr << "string match" << endl; break; } } User-defined Matches: The real power of this system comes from the ability of the user to define functions that can been be used to do more powerful matches. A user-defined match function has one of the following signatures: int func(Parse&); int func(Parse&, T&); int func(Parse&, S&, T&); It returns a 0 if the match fails. S and T are just parameters used to return values. Consider a function that matches an identifier only if it has been defined before. Also, if it does find the identifier, it returns the pointer to its symbol table entry. int name(Parse& P, StbEntry *& stb) { Token t; P, IDENT(t); if( P && (stb = StbFind(t.string(), t.length())) != 0 ) { return 1; } return 0; } These functions are invoked as follows: P , MATCH(func); P , MATCH(func, var); P , MATCH(func, var, var); So, the name function can be used as follows: Token t; StbEntry * var; P, MATCH(name, var), "=", NUMBER(val), CHOOSE(1), OR, IDENT(t), "=", NUMBER(val), CHOOSE(2); Notice that the first and is matched _ONLY_ if name is matched (i.e. returns 1), which means that it is matched only if the ident has been entered in the symbol table earlier. Some Common Idioms: If we want zero or more of a particular match, we use recursion: int foostar(Parse& P) { P , MATCH(foo) , MATCH(foostar), OR /* nothing */; return P; } To read an entire line int getline(Parse& P) { Token t; do { P, CHAR(t); } while(t[0] != '\n' ); } BACKTRACKING: Backtracking takes place every time an OR is encountered, and no prior and sequence has been matched. Parsing resumes at the point where the input was on entry to the function. This is why matching a list of items requires recursion, and not iteration. The number of lines parsed is reset on a backtrack.